1. Introduction

The purpose of this project is to amplify the presence of shared information in the event of a medical emergency by providing an estimated time of arrival for emergency medical services (EMS).

As today’s medical heroes work diligently to preserve lives when people are in danger, they work in imperfect urban-rural landscape in the context of larger socio-economic and environmental conditions. The emergency medical systems faces call surges, capacity overloads, and inclement weather while trying to meet the complex, diverse, and unpredictable needs of people in a life-safety crisis in the most efficient manner possible. During the COVID-19 Pandemic, first responders are reaching our for help in addressing these challenges.1

This project develops an algorithm to predict EMS response time – the time difference between a user to calling 911 and for assistance to arrive on-scene. It uses publicly available data from the city of Virginia Beach’s Open Data Portal, EMS Calls for Service.2

Knowing the potential response time for an ambulance is important because it allows those in need to assess their most efficient ways of receiving care in the options that best suits the user. For example, if a user has an ankle sprain, and there’s a 11 minute estimated time of arrival for an ambulance to arrive at their location, but only 15 minutes if a user received a ride from a partner, they are now in a better position to assess the cost/benefit of calling an ambulance for that 4-minute difference.

The project proposes to use this information in a new mobile-phone application called MyEMS. The app allows users to select their medical emergency, visualize all the different ways they can reach care and then request an ambulance or receive driving directions to the nearest hospital. They will also be able to immediately get connected with a medical staff via telemedicine while they wait for care.

You can also watch our presentation about the project here: https://www.youtube.com/watch?v=e-DnK3K7Ehk&feature=youtu.be

2. Data

datasource_table <-data.frame(
  Datasets = c("Virginia Beach EMS calls data", "Tract level U.S. Census demographic data", "Weather data", "Tract level chronic disease and health outcome data"),
  Source = c("Provided by class", "U.S. Census Bureau", "Riem package which gather weather data from \n Automated Surface Observing System (ASOS) stations (airports)", "Center for Disease Control and Prevention")
)

kbl(datasource_table, caption ="<b>Datasets used<b>") %>%
  kable_paper(c("hover", "condensed"), full_width = T, html_font = "Avenir") %>%
  kable_styling (bootstrap_options = "striped", "condensed", font_size = 10) %>%
  row_spec(0, bold = T, color = "white", background = "#2e4057", font_size = 14) %>%
  column_spec(1, bold = T)
Datasets used
Datasets Source
Virginia Beach EMS calls data Provided by class
Tract level U.S. Census demographic data U.S. Census Bureau
Weather data Riem package which gather weather data from Automated Surface Observing System (ASOS) stations (airports)
Tract level chronic disease and health outcome data Center for Disease Control and Prevention

Our main dataset was provided by the class, which has the time and other characteristics of the EMS calls in Virginia Beach during January 2017 through February 2018. In the following section, we will discuss further about the feature engineering process using the data from this dataset. Additionally, we have used tract level U.S. Census demographic data from the U.S. Census Bureau, weather data at the time of the calls using the reim package, and tract level chronic disease and health outcome data from the Center for Disease Control and Prevention. We have selected these datasets to better understand the what factors may contribute to the response time of an EMS call.

2.1. Load main EMS data

# load main dataset and make it an SF object
main_ems <- read.csv("Main_VaBeach_EMS_2017_18.csv")
main_ems.sf <- main_ems %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform('EPSG:6595')

The main EMS dataset (main_ems.sf) includes a various temporal and other characteristics of the individual EMS calls. Using the temporal features in the dataset, we have engineered the following list of features including ResponseTime, our main feature we hope to be able to predict. We have engineered these features to find any factors that may be helpful in predicting the ResponesTime of the EMS calls. We have also created CallVolume feature, which aggregates the number of calls during the 3 hour periods where the individual calls falls under. For example, CallVolume value for an EMS call at 2pm on January 21, 2017, will be the total number of EMS calls during the period of 12pm to 3pm on January 21, 2017.

timefeatures_table <-data.frame(
  Features = c("ResponseTime", "time_of_day", "times", "dotw", "weekend", "CallVolume", "season", "holiday"),
  Description = c("Time from the initial call to EMS arriving at scene", "6 hour periods of a day (Overnight, Morning, Afternoon, Evening)", "3 hour periods of a day (one-three, three-six, six-nine, nine-twelve, twelve-fifteen, fifteen-eighteen, eighteen-twentyone,twentyone-twentyfour)", "Day of the week", "weekend", "Number of calls during the 3 hour period", "Spring, Summer, Fall, or Winter", "Includes major holidays - New Years Day, Martin Luther King Jr. Day, Memorial Day, Independence Day, Labor Day, Thanksgiving Day, and Christmas Day"),
  Type = c("Continous","Categorical", "Categorical", "Categorical", "Binary", "Continous", "Binary", "Binary")
)

kbl(timefeatures_table, caption ="<b>Engineered temporal features<b>") %>%
  kable_paper(c("hover", "condensed"), full_width = T, html_font = "Avenir") %>%
  kable_styling (bootstrap_options = "striped", "condensed", font_size = 10) %>%
  row_spec(0, bold = T, color = "white", background = "#2e4057", font_size = 14) %>%
  column_spec(1, bold = T)
Engineered temporal features
Features Description Type
ResponseTime Time from the initial call to EMS arriving at scene Continous
time_of_day 6 hour periods of a day (Overnight, Morning, Afternoon, Evening) Categorical
times 3 hour periods of a day (one-three, three-six, six-nine, nine-twelve, twelve-fifteen, fifteen-eighteen, eighteen-twentyone,twentyone-twentyfour) Categorical
dotw Day of the week Categorical
weekend weekend Binary
CallVolume Number of calls during the 3 hour period Continous
season Spring, Summer, Fall, or Winter Binary
holiday Includes major holidays - New Years Day, Martin Luther King Jr. Day, Memorial Day, Independence Day, Labor Day, Thanksgiving Day, and Christmas Day Binary
# create time bins (30 min and 60 min)
# create days and weeks
# create ResponseTime from "CallDateandTime" to "OnSceneDateandTime" 
# create "time_of_day" (6 hour periods)
# create "times" (3 hour periods)
main_ems.sf <- main_ems.sf %>%
  mutate(interval60 = floor_date(mdy_hm(CallDateandTime), unit = "60 mins"),
         interval30 = floor_date(mdy_hm(CallDateandTime), unit = "30 mins"),
         week = week(interval60),
         dotw = as.character(wday(interval60, label=TRUE))) %>%
  mutate(ResponseTime =  as.numeric(difftime(mdy_hm(OnSceneDateandTime), mdy_hm(CallDateandTime), units = "min"))) %>%
  mutate(time_of_day = case_when(hour(interval60) >= 0 & hour(interval60) < 6 ~ "Overnight",
                                 hour(interval60) >= 6 & hour(interval60) < 12 ~ "Morning",
                                 hour(interval60) >= 12 & hour(interval60) < 18 ~ "Afternoon",
                                 hour(interval60) >= 18 & hour(interval60) <= 24 ~ "Evening")) %>%
  mutate(times = case_when(hour(interval60) >= 0 & hour(interval60) < 3 ~ "one-three",
                           hour(interval60) >= 3 & hour(interval60) < 6 ~ "three-six",
                           hour(interval60) >= 6 & hour(interval60) < 9 ~ "six-nine",
                           hour(interval60) >= 9 & hour(interval60) < 12 ~ "nine-twelve",
                           hour(interval60) >= 12 & hour(interval60) < 15 ~ "twelve-fifteen",
                           hour(interval60) >= 15 & hour(interval60) < 18 ~ "fifteen-eighteen",
                           hour(interval60) >= 18 & hour(interval60) < 21 ~ "eighteen-twentyone",
                           hour(interval60) >= 21 & hour(interval60) <= 24 ~ "twentyone-twentyfour")) %>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")) %>%
  mutate(season = case_when(week >= 1 & week < 10 ~ "Winter",
                            week >= 10 & week < 23 ~ "Spring",
                            week >= 23 & week < 40 ~ "Summer",
                            week >= 40 & week < 47 ~ "Fall", 
                            week >= 47 & week <= 52 ~ "Winter")) %>%
  mutate(Date = substring(CallDateandTime,1,7)) %>%
  unite(Date_timeofday, c(Date, time_of_day), sep = " ", remove = FALSE) %>%
  unite(Date_time, c(Date, times), sep = " ", remove = FALSE) %>%
  mutate(holiday = ifelse(Date %in% c("1/2/17 ", "1/16/17", "5/29/17", "7/4/17 ", "9/4/17 ", "11/23/1", "12/25/1", "1/1/18 ", "1/15/18"), "Holiday", "Non-Holiday")) %>%
  drop_na()
# create call volume column
vol_count_dat <- main_ems.sf %>%
  st_drop_geometry() %>%
  group_by(Date_time) %>%
  dplyr::summarise(CallVolume = n())

# join the volume data file to main dataframe
main_ems.sf <-
  left_join(main_ems.sf, vol_count_dat, by="Date_time")

Additionally, the main_ems.sf dataset includes a column with RescueSquad.Number. Each EMS calls are coded with the characteristics of the calls and the units that responded to the scene. We were able to contact one of the chiefs of the Virginia Beach EMS stations who was able to provide the description of these rescue squad codes. With that information, we have categorized the rescue squad types into a simpler categories, which you can see from the list below.

RescueSquad_table <-data.frame(
  Features = c("advanced_life_support", "chief_dispatch", "special_event", "air_dispatch", "hold_dispatch", "water_dispatch", "mass_casualty_dispatch", "fire_dispatch"),
  Description = c("Calls requiring advanced life support unit", "Calls requiring an EMS chief","Calls at a special event, like a concert or marathon", "Calls requiring Helicopter Ambulances", "Calls when the dispatcher was waiting on another ambulence", "Marine EMS Calls requiring rescue boat or jetski", "Calls requiring mass causalty unit", "Calls related to fire incidents"),
  Type = c("Binary","Binary", "Binary", "Binary", "Binary", "Binary", "Binary", "Binary")
)

kbl(RescueSquad_table , caption ="<b>Rescue squad types engineered features<b>") %>%
  kable_paper(c("hover", "condensed"), full_width = T, html_font = "Avenir") %>%
  kable_styling (bootstrap_options = "striped", "condensed", font_size = 10) %>%
  row_spec(0, bold = T, color = "white", background = "#2e4057", font_size = 14) %>%
  column_spec(1, bold = T)
Rescue squad types engineered features
Features Description Type
advanced_life_support Calls requiring advanced life support unit Binary
chief_dispatch Calls requiring an EMS chief Binary
special_event Calls at a special event, like a concert or marathon Binary
air_dispatch Calls requiring Helicopter Ambulances Binary
hold_dispatch Calls when the dispatcher was waiting on another ambulence Binary
water_dispatch Marine EMS Calls requiring rescue boat or jetski Binary
mass_casualty_dispatch Calls requiring mass causalty unit Binary
fire_dispatch Calls related to fire incidents Binary
main_ems.sf$advanced_life_support <- ifelse (
  (
    endsWith(main_ems.sf$RescueSquad.Number, 'S') |
      startsWith(main_ems.sf$RescueSquad.Number, 'Z') |
      endsWith(main_ems.sf$RescueSquad.Number, 'P') |
      startsWith(main_ems.sf$RescueSquad.Number, 'MED')
  ), 
  "Advanced Life Support", "Not Advanced Life Support"
)

main_ems.sf$b_advanced_life_support <- ifelse (
  (
    endsWith(main_ems.sf$RescueSquad.Number, 'S') |
      startsWith(main_ems.sf$RescueSquad.Number, 'Z') |
      endsWith(main_ems.sf$RescueSquad.Number, 'P') |
      startsWith(main_ems.sf$RescueSquad.Number, 'MED')
  ), 
  1, 0
)


main_ems.sf$chief_dispatch <- ifelse (
  (
    startsWith(main_ems.sf$RescueSquad.Number, 'ECH') |
      (((startsWith(main_ems.sf$RescueSquad.Number, 'CAR')) == TRUE) & ((startsWith(main_ems.sf$RescueSquad.Number, 'CART')) == FALSE)) |
      startsWith(main_ems.sf$RescueSquad.Number, 'BAT')
  ), 
  "Chief Dispatched", "Cheif not Dispatched"
)

main_ems.sf$b_chief_dispatch <- ifelse (
  (
    startsWith(main_ems.sf$RescueSquad.Number, 'ECH') |
      (((startsWith(main_ems.sf$RescueSquad.Number, 'CAR')) == TRUE) & ((startsWith(main_ems.sf$RescueSquad.Number, 'CART')) == FALSE)) |
      startsWith(main_ems.sf$RescueSquad.Number, 'BAT')
  ), 
  1, 0
)

main_ems.sf$special_event <- ifelse (
  (
    main_ems.sf$RescueSquad.Number == 'BKTEAM' |
      main_ems.sf$RescueSquad.Number == 'EMSOPS' |
      startsWith(main_ems.sf$RescueSquad.Number, 'CART')
  ), 
  "Special Event", "Not Special Event"
)

main_ems.sf$b_special_event <- ifelse (
  (
    main_ems.sf$RescueSquad.Number == 'BKTEAM' |
      main_ems.sf$RescueSquad.Number == 'EMSOPS' |
      startsWith(main_ems.sf$RescueSquad.Number, 'CART')
  ), 
  1, 0
)


main_ems.sf$air_dispatch <- ifelse (
  (
    main_ems.sf$RescueSquad.Number == 'AIRMED'
  ), 
  "Air unit", "Not an Air Unit"
)

main_ems.sf$b_air_dispatch <- ifelse (
  (
    main_ems.sf$RescueSquad.Number == 'AIRMED'
  ), 
  1, 0
)

main_ems.sf$hold_dispatch <- ifelse (
  (
    startsWith(main_ems.sf$RescueSquad.Number, 'HOLD')
  ), 
  "Held By Dispatcher", "Not Held by Dispatcher"
)

main_ems.sf$b_hold_dispatch <- ifelse (
  (
    startsWith(main_ems.sf$RescueSquad.Number, 'HOLD')
  ), 
  1, 0
)


main_ems.sf$water_dispatch <- ifelse (
  (
    startsWith(main_ems.sf$RescueSquad.Number, 'FBOA') |
      startsWith(main_ems.sf$RescueSquad.Number, 'RB') |
      startsWith(main_ems.sf$RescueSquad.Number, 'MRTK') |
      startsWith(main_ems.sf$RescueSquad.Number, 'JTSKI')
    
  ), 
  "Water Unit", "Not a Water Unit"
)

main_ems.sf$b_water_dispatch <- ifelse (
  (
    startsWith(main_ems.sf$RescueSquad.Number, 'FBOA') |
      startsWith(main_ems.sf$RescueSquad.Number, 'RB') |
      startsWith(main_ems.sf$RescueSquad.Number, 'MRTK') |
      startsWith(main_ems.sf$RescueSquad.Number, 'JTSKI')
    
  ), 
  1, 0
)


main_ems.sf$mass_casualty_dispatch <- ifelse (
  (
    startsWith(main_ems.sf$RescueSquad.Number, 'MCI') 
  ), 
  "Mass Casualty Incident", "Not a Mass Casualty Incident"
)

main_ems.sf$b_mass_casualty_dispatch <- ifelse (
  (
    startsWith(main_ems.sf$RescueSquad.Number, 'MCI') 
  ), 
  1, 0
)


main_ems.sf$fire_dispatch <- ifelse (
  (
    startsWith(main_ems.sf$RescueSquad.Number, 'L') |
      (startsWith(main_ems.sf$RescueSquad.Number, 'E') & ((startsWith(main_ems.sf$RescueSquad.Number, 'ECH') == FALSE) & (main_ems.sf$RescueSquad.Number != 'EMSOPS'))) |
      (startsWith(main_ems.sf$RescueSquad.Number, 'T') & (startsWith(main_ems.sf$RescueSquad.Number, 'TAC') == FALSE)) |
      endsWith(main_ems.sf$RescueSquad.Number, 'F') |
      startsWith(main_ems.sf$RescueSquad.Number, 'NE')
  ), 
  "Fire Team Dispactched", "Fire Team not Dispatched"
)

main_ems.sf$b_fire_dispatch <- ifelse (
  (
    startsWith(main_ems.sf$RescueSquad.Number, 'L') |
      (startsWith(main_ems.sf$RescueSquad.Number, 'E') & ((startsWith(main_ems.sf$RescueSquad.Number, 'ECH') == FALSE) & (main_ems.sf$RescueSquad.Number != 'EMSOPS'))) |
      (startsWith(main_ems.sf$RescueSquad.Number, 'T') & (startsWith(main_ems.sf$RescueSquad.Number, 'TAC') == FALSE)) |
      endsWith(main_ems.sf$RescueSquad.Number, 'F') |
      startsWith(main_ems.sf$RescueSquad.Number, 'NE')
  ), 
  1, 0
)

2.2 Load spatial features

We were also interested in the spatial features of the EMS calls. Using the locations of all EMS stations, fire stations, and hospitals in Virginia Beach, we have calculated the distance of the nearest locations of the three locations types through nn_function. Using this engineered feature, we will explore whether the distance to EMS stations, fire stations, or hospitals contribute to the ResponseTime of each calls.

nn_table <-data.frame(
  Features = c("nn_ems_station", "nn_fire_stations", "nn_hospitals"),
  Description = c("distance to nearest EMS station", "distance to nearest fire station", "distance to nearest hosptial"),
  Type = c("Continuous","Continuous", "Continuous")
)

kbl(nn_table , caption ="<b>Eengineered spatial features<b>") %>%
  kable_paper(c("hover", "condensed"), full_width = T, html_font = "Avenir") %>%
  kable_styling (bootstrap_options = "striped", "condensed", font_size = 10) %>%
  row_spec(0, bold = T, color = "white", background = "#2e4057", font_size = 14) %>%
  column_spec(1, bold = T)
Eengineered spatial features
Features Description Type
nn_ems_station distance to nearest EMS station Continuous
nn_fire_stations distance to nearest fire station Continuous
nn_hospitals distance to nearest hosptial Continuous
## Reading layer `ESRIJSON' from data source `https://gismaps.vbgov.com/arcgis/rest/services/Basemaps/Administrative_Boundaries/MapServer/0/query?where=1%3D1&outFields=*&outSR=4326&f=json' using driver `ESRIJSON'
## Simple feature collection with 1 feature and 3 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -76.22806 ymin: 36.55038 xmax: -75.86756 ymax: 36.93259
## geographic CRS: WGS 84
## Reading layer `Administrative_Boundaries' from data source `https://opendata.arcgis.com/datasets/82ada480c5344220b2788154955ce5f0_2.geojson' using driver `GeoJSON'
## Simple feature collection with 100 features and 15 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -76.22776 ymin: 36.55045 xmax: -75.7975 ymax: 37.04089
## geographic CRS: WGS 84
## Reading layer `201127_ems_geocoded' from data source `/Users/davidseungleepark/Library/Mobile Documents/com~apple~CloudDocs/Fall 2020/cpln592/Predicting-EMS-calls-in-Virginia-Beach/201127_ems_geocoded.shp' using driver `ESRI Shapefile'
## Simple feature collection with 16 features and 12 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -8481659 ymin: 4381398 xmax: -8454246 ymax: 4426743
## projected CRS:  WGS 84 / Pseudo-Mercator
## Reading layer `Hospitals__Virginia_' from data source `/Users/davidseungleepark/Library/Mobile Documents/com~apple~CloudDocs/Fall 2020/cpln592/Predicting-EMS-calls-in-Virginia-Beach/Hospitals__Virginia_.shp' using driver `ESRI Shapefile'
## Simple feature collection with 169 features and 17 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -83.05103 ymin: 36.57806 xmax: -75.86424 ymax: 39.19399
## geographic CRS: WGS 84
## Reading layer `Fire_Stations' from data source `/Users/davidseungleepark/Library/Mobile Documents/com~apple~CloudDocs/Fall 2020/cpln592/Predicting-EMS-calls-in-Virginia-Beach/Fire_Stations.shp' using driver `ESRI Shapefile'
## Simple feature collection with 52184 features and 24 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -19663710 ymin: 2002414 xmax: -7193591 ymax: 11505290
## projected CRS:  WGS 84 / Pseudo-Mercator

Using the EMS station locations, we also created a voronoi map using the following function. Assuming the assignment of call dispatches depend on the closest EMS stations, the voronoi map of Virginia would allow us to understand if there are specific “EMS station zones” that are seeing either longer or shorter EMS response time.

# create voronoi
bbox_polygon <- function(x) {
  bb <- sf::st_bbox(x)
  
  p <- matrix(
    c(bb["xmin"], bb["ymin"], 
      bb["xmin"], bb["ymax"],
      bb["xmax"], bb["ymax"], 
      bb["xmax"], bb["ymin"], 
      bb["xmin"], bb["ymin"]),
    ncol = 2, byrow = T
  )
  
  sf::st_polygon(list(p))
}

box <- st_sfc(bbox_polygon(vb_boundary))

v <- st_voronoi(st_union(ems_stations), box)

voronoi_polygon <- (st_intersection(st_cast(v), st_union(vb_boundary))) %>%
  st_sf()

vb_voronoi <-st_join(voronoi_polygon,ems_stations, left =TRUE) %>%
  mutate(voronoi_id = OBJECTID) %>%
  dplyr::select(geometry, voronoi_id)


ggplot() + 
  geom_sf(data = vb_voronoi, fill = "#1167b1") +
  geom_sf(data = ems_stations, color="white") +
  labs(title = "Virginia Beach EMS station zones \nusing voronoi function") +
  mapTheme

#Assign Voronoi ID for each EMS call
main_ems.sf <- st_join(main_ems.sf, vb_voronoi, left = TRUE)

2.3 Load census and health data

We have also loaded U.S. Census data including the features that we may think may affect the ResponseTime of the EMS calls. Since we do not have the demographic information of the persons of the EMS calls, the tract level demographic information from the U.S. Census American Community Survey(ACS) is the closest we will get to understand the demographic information of the calls. In the later sections, we will we exploring the relationships with some of these variables and the call response time.

census_vars_table <-data.frame(
  Features = c("Med_Age", "MedHHInc", "PopDens", "pctWhite", "pctBlack", "pctHis", "pctBachelors","pctPoverty", "pctCarCommute", "pctPubCommute"),
  Description = c("Median Age", "Median household income", "Population Density", "Percentage of White population", "Percentage of Black population", "Percentage of Hispanic populaiton", "Percentage of population with at least Bachelor's degree", "Percentage of population below the poverty line", "Percentage of population communiting by car", "Percentage of population commuting by public transit"),
  "Type" = c("continuous", "continuous", "continuous", "continuous", "continuous", "continuous", "continuous", "continuous", "continuous", "continuous"))

kbl(census_vars_table , caption ="<b> U.S. Census features<b>") %>%
  kable_paper(c("hover", "condensed"), full_width = T, html_font = "Avenir") %>%
  kable_styling (bootstrap_options = "striped", "condensed", font_size = 10) %>%
  row_spec(0, bold = T, color = "white", background = "#2e4057", font_size = 14) %>%
  column_spec(1, bold = T)
U.S. Census features
Features Description Type
Med_Age Median Age continuous
MedHHInc Median household income continuous
PopDens Population Density continuous
pctWhite Percentage of White population continuous
pctBlack Percentage of Black population continuous
pctHis Percentage of Hispanic populaiton continuous
pctBachelors Percentage of population with at least Bachelor’s degree continuous
pctPoverty Percentage of population below the poverty line continuous
pctCarCommute Percentage of population communiting by car continuous
pctPubCommute Percentage of population commuting by public transit continuous
# Load census (pop_density, med_age, race, poverty, income, education, commute)
census_api_key("41e1c0d912341017fa6f36a5da061d3b23de335e", overwrite = TRUE)
selected_vars <- c("B01003_001E", # Total Population
                   "B02001_001E", # Estimate!!Total population by race -- ##let's double check that it's okay to use this as long as we justify it
                   "B02001_002E", # People describing themselves as "white alone"
                   "B02001_003E", # People describing themselves as "black" or "african-american" alone
                   "B15001_050E", # Females with bachelors degrees
                   "B15001_009E", # Males with bachelors degrees
                   "B19013_001E", # Median HH income
                   "B25058_001E", # Median rent
                   "B06012_002E", # Total poverty
                   "B08301_001E", # People who have means of transportation to work
                   "B08301_002E", # Total people who commute by car, truck, or van
                   "B08301_010E", # Total people who commute by public transportation"
                   "B03002_012E", # Estimate Total Hispanic or Latino by race
                   "B19326_001E", # Median income in past 12 months (inflation-adjusted)
                   "B07013_001E", # Total households
                   "B08013_001E", # Travel Time to Work
                   "B01002_001E") # Median Age

vb_census <- 
  get_acs(geography = "tract", 
          variables = selected_vars, 
          year=2018, 
          state="VA",
          county = c("Virginia Beach"),
          geometry=T, 
          output="wide") %>%
  st_transform('EPSG:6595') %>%
  rename(TotalPop = B01003_001E,
         Med_Age = B01002_001E,
         Race_TotalPop = B02001_001E, 
         Whites = B02001_002E,
         Blacks = B02001_003E,
         FemaleBachelors = B15001_050E, 
         MaleBachelors = B15001_009E,
         MedHHInc = B19013_001E,
         TotalPoverty = B06012_002E,
         TotalCommute = B08301_001E,
         CarCommute = B08301_002E,
         PubCommute = B08301_010E,
         TotalHispanic = B03002_012E,
         MedInc = B19326_001E,
         TotalHH = B07013_001E)
vb_census <- vb_census %>%
  dplyr::select(-NAME, -starts_with("B0"), -starts_with("B1"), -starts_with("B2")) %>%
  mutate(Area = st_area(vb_census),
         pctWhite = (ifelse(Race_TotalPop > 0, Whites / Race_TotalPop,0))*100,
         pctBlack = (ifelse(Race_TotalPop > 0, Blacks / Race_TotalPop,0))*100,
         pctHis = (ifelse(Race_TotalPop >0, TotalHispanic/Race_TotalPop,0))*100,
         pctBachelors = (ifelse(Race_TotalPop > 0, ((FemaleBachelors + MaleBachelors) / Race_TotalPop),0)) *100,
         pctPoverty = (ifelse(Race_TotalPop > 0, TotalPoverty / Race_TotalPop, 0))*100,
         pctCarCommute = (ifelse(TotalCommute > 0, CarCommute / TotalCommute,0))*100,
         pctPubCommute = (ifelse(TotalCommute > 0, PubCommute / TotalCommute,0))*100,
         year = "2018") %>%
  mutate(MedHHInc = replace_na(MedHHInc, 0),
         pctBachelors= replace_na(pctBachelors,0),
         pctHis= replace_na(pctHis,0),
         pctCarCommute= replace_na(pctCarCommute,0),
         PopDens = (TotalPop/(Area/27878400))) %>%
  dplyr::select(-Whites, -Blacks, -FemaleBachelors, -MaleBachelors, -TotalPoverty, -CarCommute, -PubCommute, -TotalCommute, -TotalHispanic)

Additionally, we also included the chronic disease and health outcome data from the Center for Disease Control and Prevention (CDC). This dataset is a census tract-level estimates for chronic disease risk factors and health outcomes, including cancer, diabetes, obesity, stroke, arthritis, asthma, chronic kidney diseases, etc. Similar to the census demographic data, the limitation of this data set is that the data is on the tract level. However, without the individual characteristics of the EMS calls, this dataset at least allows us to analyze whether if there is chronic diseases and health outcomes impact the response time of the EMS calls in Virginia Beach.

health_dat <- read.csv('health_data_500_cities_vabch.csv') %>% 
  dplyr::select(TractFIPS, ends_with('CrudePrev')) %>% 
  rename(GEOID = TractFIPS)
vb_health <- merge(health_dat, vb_census,
                   by.x = "GEOID", by.y = "GEOID",
                   all.x = FALSE, all.y = TRUE,
                   sort = FALSE) %>% 
  dplyr::select(GEOID, ends_with('CrudePrev'), geometry) %>% 
  st_sf()


main_ems.sf <-st_join(main_ems.sf,vb_census, left =TRUE) %>%
  mutate(TotalPop = TotalPop,
         MedHHInc = MedHHInc,
         TotalHH = TotalHH,
         pctWhite = pctWhite,
         pctBlack = pctWhite,
         pctHis= pctHis,
         pctBachelors= pctBachelors,
         pctPoverty = pctPoverty,
         pctCarCommute= pctCarCommute,
         pctPubCommute= pctPubCommute,
         PopDens= PopDens)

#you can find variable names here: https://www.cdc.gov/places/about/500-cities-2016-2019/index.html
main_ems.sf <-st_join(main_ems.sf,vb_health, left =TRUE) %>%
  mutate(
    ACCESS2_CrudePrev = ACCESS2_CrudePrev,
    ARTHRITIS_CrudePrev = ARTHRITIS_CrudePrev,
    BINGE_CrudePrev = BINGE_CrudePrev,
    BPHIGH_CrudePrev = BPHIGH_CrudePrev,
    BPMED_CrudePrev = BPMED_CrudePrev,
    CANCER_CrudePrev = CANCER_CrudePrev,
    CASTHMA_CrudePrev = CASTHMA_CrudePrev,
    CHD_CrudePrev = CHD_CrudePrev,
    CHECKUP_CrudePrev = CHECKUP_CrudePrev,
    CHOLSCREEN_CrudePrev = CHOLSCREEN_CrudePrev,
    COLON_SCREEN_CrudePrev = COLON_SCREEN_CrudePrev,
    COPD_CrudePrev = COPD_CrudePrev,
    COREM_CrudePrev = COREM_CrudePrev,
    COREW_CrudePrev = COREW_CrudePrev,
    CSMOKING_CrudePrev = CSMOKING_CrudePrev,
    DENTAL_CrudePrev = DENTAL_CrudePrev,
    DIABETES_CrudePrev = DIABETES_CrudePrev,
    HIGHCHOL_CrudePrev = HIGHCHOL_CrudePrev,
    KIDNEY_CrudePrev = KIDNEY_CrudePrev,
    LPA_CrudePrev = LPA_CrudePrev,
    MAMMOUSE_CrudePrev = MAMMOUSE_CrudePrev,
    MHLTH_CrudePrev = MHLTH_CrudePrev,
    OBESITY_CrudePrev = OBESITY_CrudePrev,
    PAPTEST_CrudePrev = PAPTEST_CrudePrev,
    PHLTH_CrudePrev = PHLTH_CrudePrev,
    SLEEP_CrudePrev = SLEEP_CrudePrev,
    STROKE_CrudePrev = STROKE_CrudePrev,
    TEETHLOST_CrudePrev = TEETHLOST_CrudePrev
  )

2.4 Load weather data

Using the reim package, we also included the hourly weather data. Hourly Temperature, Precipitation, Wind Speed, and Humidity data in Virginia Beach were matched to EMS calls. We have also created three additional binary features SnowPresent, HeavyRain, and StrongWind to see if this will make the significance of the relationship with call response time better. SnowPresent feature was created by filtering hours when the temperature was below 42F and Precipitation is above 0. HeavyRain feature was created by filtering hours when the Precipitation was above 0.5 inches. StrongWind feature was created by filtering hours when the the average hourly Wind_Speed was above 30mph.

# Load weather data and create features
vb_weather <- 
  riem_measures(station = "NTU", date_start = "2017-01-01", date_end = "2018-03-01") %>%
  dplyr::select(valid, tmpf, p01i, sknt, relh)%>%
  replace(is.na(.), 0) %>%
  mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label=TRUE)) %>%
  group_by(interval60) %>%
  dplyr::summarise(Temperature = max(tmpf),
            Precipitation = sum(p01i),
            Wind_Speed = max(sknt),
            Humidity = max(relh)) %>%
  mutate(Temperature = ifelse(Temperature == 0, 42, Temperature)) %>%
  mutate(SnowPresent = ifelse(Precipitation > 0.0 & Temperature < 32.0, "Snow", "NoSnow"),
         HeavyRain = ifelse(Precipitation > 0.5, "HeavyRain", "NoHeavyRain"),
         StrongWind = ifelse(Wind_Speed > 30, "StrongWind","NoStrongWind"))

# join weather data to the main dataframe
main_ems.sf <-
  left_join(main_ems.sf, vb_weather, by="interval60")

##GET RID OF NAs
main_ems.sf <- main_ems.sf %>%
  drop_na()

3. Exploratory Analysis

Now that we have loaded all of the necessary datasets, we will be visualizing the selected features’ relationship with the EMS call response time. In our analyses, we will be looking at both spatial and temporal relationships.

3.1 Spatial distribution of EMS calls

First, we wanted to understand where the EMS calls are coming from. We created the fishnet of from the Virginia Beach boundary shapefile and aggregated the EMS calls for each grid. The following map shows that areas in the northern part of the city sees the most EMS calls. Thsi distribution makes sense as most developments congregate around the nothern part of the city boundary.

# create fishnet
vb_fishnet <- 
  st_make_grid(vb_boundary,
               cellsize = 1000, 
               square = FALSE) %>%
  .[vb_boundary] %>% 
  st_sf() %>%
  mutate(uniqueID = rownames(.))

# create EMS calls on fishnet
vb_ems_fishnet <- 
  dplyr::select(main_ems.sf) %>%    
  mutate(countEMS = 1,
  ) %>%    
  aggregate(., vb_fishnet, sum) %>%    
  mutate(countEMS = replace_na(countEMS, 0),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(vb_fishnet) / 24),          
                       size=nrow(vb_fishnet), replace = TRUE))

# map fishnet of EMS calls in VB
ggplot() +
  geom_sf(data = vb_ems_fishnet, aes(fill = countEMS), color = NA) +
  scale_fill_viridis() +
  labs(title = "Fishnet of EMS calls in Virginia Beach") +
  mapTheme

3.2 Response time of EMS calls

Now that we have a better sense of where the EMS calls are coming from, we wanted to better understand the ResponseTime of the EMS calls. The Histogram below shows that the most calls are under 20 minutes while there is a tail of some calls that are much longer. The yellow dotted line in the histogram represents the average of ResponseTime of all calls in the data set is 8.72 which can be translated to 8 minutes and 42 seconds.

# histogram of the ResponseTime
ggplot(main_ems.sf, aes(x=ResponseTime)) +
    geom_histogram(binwidth=1, fill = "darkblue") +
    xlim(0, 50) +
    labs(title="Histogram of all Response Time of EMS calls",
         x="ResponseTime", 
         y="Count") +
    geom_vline(xintercept = 8.72, linetype="dotted", 
                color = "orange", size=1.5) +
    plotTheme

# mapping all calls response time 
ResponseTime_net <- 
  main_ems.sf %>%
  dplyr::select(ResponseTime, dotw, time_of_day) %>%
  mutate(ResponseTime = as.numeric(ResponseTime)) %>% 
  dplyr::select(ResponseTime) %>%
  aggregate(., vb_fishnet, mean) %>%
  mutate(ResponseTime = replace_na(ResponseTime, 0))

Then, what does the ResponseTime across space? Using the same fishent, we have calculated the average response time of all calls within the fishnet grids. The lighter colors represent a higher average response time of all the calls within the fishnet grid. You can see from the map below that there are parts in the northern part of the city, where dead end streets along the rivers and bays experience a higher average response time. You can also see that some parts in the southern part of the city, which is more rural compared to the more developed northern part of the city, also experience a higher response time. This may be the result of the location of the calls being distant from from EMS stations (depicted in white dot).

#map of response time
ggplot() + 
    geom_sf(data = vb_boundary, fill = "black") +
    geom_sf(data = ResponseTime_net %>%
              dplyr::filter(ResponseTime > 0), color = NA, aes(fill = ResponseTime)) +
    scale_fill_viridis_c(option="plasma") +
    geom_sf(data = ems_stations, color="white") +
    labs(title= "Response Time of EMS calls by fishnet") +
    mapTheme

How does response time differ by the EMS stations? Since we do not have th information of where the EMS dispatches are coming from, the best alternative is using the EMS station voronoi map we created earlier, assuming most EMS dispatches occur by the distance to the EMS station. The following map shows the average response time of the EMS calls byt he EMS station zones. This map allows us to see which EMS stations are resulting in longer response time.

voronoi_ResponseTime <- 
  main_ems.sf %>%
  dplyr::select(ResponseTime) %>%
  mutate(ResponseTime = as.numeric(ResponseTime)) %>% 
  dplyr::select(ResponseTime) %>%
  aggregate(., vb_voronoi, mean) %>%
  mutate(ResponseTime = replace_na(ResponseTime, 0))


#map of response time
ggplot() + 
    geom_sf(data = voronoi_ResponseTime %>%
              dplyr::filter(ResponseTime > 0), color = NA, aes(fill = ResponseTime)) +
    scale_fill_viridis_c(option="plasma") +
    geom_sf(data = ems_stations, color="white") +
    labs(title= "Avg Response Time of EMS calls by EMS Station Zones") +
    mapTheme

3.3. Response Time by Call Volume

One of the factors that we initially thought would be impactful in determining response time of an EMS call was the EMS call volume at the time of the call. As described in the previous section CallVolume feature was created by aggregating the number of calls during the 3 hour periods where the individual calls falls under, which then were assigned to individual calls. For example, CallVolume value for an EMS call at 2pm on January 21, 2017, will be the total number of EMS calls during the period of 12pm to 3pm on January 21, 2017.

What we see from the scatterplot below is that CallVolume doesn’t seem to be a strong determinant of the EMS call response time. We initially suspected that the more calls there are during the time a particular EMS call is called, the longer the response time would be. However, this shows that there are many other factors that may be influencing the the response time of the EMS calls.

# plot call volume
ggplot(main_ems.sf, aes(x=CallVolume, y=ResponseTime)) +
  geom_point(size = .75, colour = "darkblue") +
  labs(title= "Average Response Time by Call Volume") +
  plotTheme

3.4 Temporal features

We’ve also explored the average ResponseTime across dotw which represents days of the week and time_of_day representing 6 hour time periods of a day (Overnight, Morning, Afternoon, Evening). The char below shows that on average, response times during weekday afternoon tend to have longer response time while weekday evening calls tend to have a shorter response time. On the other hand, weekend afternoon calls have shorter response time. The second chart shows the average response time by weekday and weekends. While the difference isn’t not drastic, the weekday calls’ response time tend to be longer.

# map Response Time of EMS calls by days of the week and time of the day
ggplot(data = main_ems.sf %>%
         group_by(dotw, time_of_day) %>%
         dplyr::summarise(meanResponeTime = mean(ResponseTime, na.rm=TRUE))) +
  geom_bar(aes(x=meanResponeTime, y=dotw, fill = time_of_day), stat="identity", position=position_dodge())+
  scale_fill_manual(values = palette5) +
  labs(title="Response Time of EMS calls by days of the week and time of the day",
       x="ResponseTime", 
       y="Day of the week")+
  plotTheme

ggplot(data = main_ems.sf %>%
           group_by(weekend) %>%
           drop_na(weekend) %>%
           dplyr::summarise(meanResponeTime = mean(ResponseTime, na.rm=TRUE))) +
    geom_bar(aes(x=weekend, y=meanResponeTime, fill=weekend), stat="identity", position=position_dodge(), show.legend = FALSE, width=0.5) +
    scale_fill_manual(values = palette2) + 
    labs(title="Response Time of EMS calls by weekday or weekend",
       x="Weekday or weekend", 
       y="ResponseTime") +
    plotTheme

Additionally, we also wanted to explore the temporal characteristics of the EMS calls’ response time in spatial context. The animated maps below shows the response time by days of the week and time of the day.

We’ve also looked at the average response time which season the calls happened, and whether the calls were during major holidays, including New Years Day, Martin Luther King Jr. Day, Memorial Day, Independence Day, Labor Day, Thanksgiving Day, and Christmas Day. You can see that calls during Winter months tend to have a longer response time. Also, the second chart shows that calls during holidays tends to be shorter.

grid.arrange(
  ggplot(data = main_ems.sf %>%
           group_by(season) %>%
           drop_na(season) %>%
           dplyr::summarise(meanResponeTime = mean(ResponseTime, na.rm=TRUE))) +
    geom_bar(aes(x=season, y=meanResponeTime, fill=season), stat="identity", position=position_dodge(), show.legend = FALSE, width=0.5) +
    scale_fill_manual(values = palette5) + 
    gridTheme, 
    ggplot(data = main_ems.sf %>%
           group_by(holiday) %>%
           drop_na(holiday) %>%
           dplyr::summarise(meanResponeTime = mean(ResponseTime, na.rm=TRUE))) +
    geom_bar(aes(x=holiday, y=meanResponeTime, fill=holiday), stat="identity", position=position_dodge(), show.legend = FALSE, width=0.5) +
    scale_fill_manual(values = palette2) + 
    gridTheme,
  ncol=2)

3.5 Call type characteristics

We’ve also looked at the call type characteristics in the main dataset. CallPriority feature consist of three types:

  • Priority 1: Critical
  • Priority 2: Urgent
  • Priority 3: Non-Urgent

Surprisingly, the Priority 1 calls did not have the shortest response time. Non-urgent Priority 3 calls had the shortest response time.

# map meanResponseTime by CallPriority
ggplot(data = main_ems.sf %>%
         group_by(CallPriority) %>%
         dplyr::summarise(meanResponeTime = mean(ResponseTime, na.rm=TRUE))) +
  geom_bar(aes(x=CallPriority, y=meanResponeTime, fill=CallPriority), stat="identity", position=position_dodge(), show.legend = FALSE, width=0.5) +
  labs(title= "Avg Response Time Difference of Call Prioriy Types") +
  plotTheme

The following sets of graphs show eight different rescue squad types. EMS Calls requiring an EMS Chief has significantly shorter response time, which is due to the fact that EMS chiefs are sent on special and critical cases. Calls with Special Events, air units, water unit, mass casualty and fire team rescue squads and significantly shorter response time. These features will be very helpful in our prediction model.

grid.arrange(
  ggplot(data = main_ems.sf %>%
         group_by(advanced_life_support) %>%
         drop_na(advanced_life_support) %>%
         dplyr::summarise(meanResponeTime = mean(ResponseTime, na.rm=TRUE))) +
  geom_bar(aes(x=advanced_life_support, y=meanResponeTime, fill=advanced_life_support), stat="identity", position=position_dodge(), show.legend = FALSE, width=0.5) +
  scale_fill_manual(values = palette2) + 
  labs(title= "Avg Response Time Difference of EMS Calls requiring Adv. Life Support") +
  gridTheme,
  
  ggplot(data = main_ems.sf %>%
         group_by(chief_dispatch) %>%
         drop_na(chief_dispatch) %>%
         dplyr::summarise(meanResponeTime = mean(ResponseTime, na.rm=TRUE))) +
  geom_bar(aes(x=chief_dispatch, y=meanResponeTime, fill=chief_dispatch), stat="identity", position=position_dodge(), show.legend = FALSE, width=0.5) +
  scale_fill_manual(values = palette2) + 
  labs(title= "Avg Response Time Difference of EMS Calls needing a Chief") +
  gridTheme,
  
  ggplot(data = main_ems.sf %>%
         group_by(special_event) %>%
         drop_na(special_event) %>%
         dplyr::summarise(meanResponeTime = mean(ResponseTime, na.rm=TRUE))) +
  geom_bar(aes(x=special_event, y=meanResponeTime, fill=special_event), stat="identity", position=position_dodge(), show.legend = FALSE, width=0.5) +
  scale_fill_manual(values = palette2) + 
  labs(title= "Avg Response Time Difference of EMS Calls at Special Events") +
  gridTheme,
  
  ggplot(data = main_ems.sf %>%
         group_by(air_dispatch) %>%
         drop_na(air_dispatch) %>%
         dplyr::summarise(meanResponeTime = mean(ResponseTime, na.rm=TRUE))) +
  geom_bar(aes(x=air_dispatch, y=meanResponeTime, fill=air_dispatch), stat="identity", position=position_dodge(), show.legend = FALSE, width=0.5) +
  scale_fill_manual(values = palette2) + 
  labs(title= "Avg Response Time Difference of EMS Calls Using Helicopter Amb.") +
  gridTheme
)

grid.arrange(
  ggplot(data = main_ems.sf %>%
         group_by(hold_dispatch) %>%
         drop_na(hold_dispatch) %>%
         dplyr::summarise(meanResponeTime = mean(ResponseTime, na.rm=TRUE))) +
  geom_bar(aes(x=hold_dispatch, y=meanResponeTime, fill=hold_dispatch), stat="identity", position=position_dodge(), show.legend = FALSE, width=0.5) +
  scale_fill_manual(values = palette2) +
  labs(title= "Avg Response Time Difference of EMS Calls on HOLD by Dispatcher") +
  gridTheme,
  
  ggplot(data = main_ems.sf %>%
         group_by(water_dispatch) %>%
         drop_na(water_dispatch) %>%
         dplyr::summarise(meanResponeTime = mean(ResponseTime, na.rm=TRUE))) +
  geom_bar(aes(x=water_dispatch, y=meanResponeTime, fill=water_dispatch), stat="identity", position=position_dodge(), show.legend = FALSE, width=0.5) +
  scale_fill_manual(values = palette2) + 
  labs(title= "Avg Response Time Difference of Marine EMS Calls") +
  gridTheme,
  
  ggplot(data = main_ems.sf %>%
         group_by(mass_casualty_dispatch) %>%
         drop_na(mass_casualty_dispatch) %>%
         dplyr::summarise(meanResponeTime = mean(ResponseTime, na.rm=TRUE))) +
  geom_bar(aes(x=mass_casualty_dispatch, y=meanResponeTime, fill=mass_casualty_dispatch), stat="identity", position=position_dodge(), show.legend = FALSE, width=0.5) +
  scale_fill_manual(values = palette2) + 
  labs(title= "Avg Response Time Difference of Mass Casualty EMS Calls") +
  gridTheme,
  
  ggplot(data = main_ems.sf %>%
         group_by(fire_dispatch) %>%
         drop_na(fire_dispatch) %>%
         dplyr::summarise(meanResponeTime = mean(ResponseTime, na.rm=TRUE))) +
  geom_bar(aes(x=fire_dispatch, y=meanResponeTime, fill=fire_dispatch), stat="identity", position=position_dodge(), show.legend = FALSE, width=0.5) +
  scale_fill_manual(values = palette2) + 
  labs(title= "Avg Response Time Difference of Fire-related EMS Calls") +
  gridTheme
)

3.6 Weather conditions

grid.arrange(
  ggplot(main_ems.sf, aes(x=Precipitation, y=ResponseTime)) +
    geom_point(size = .75, colour = "darkblue") +
    gridTheme,
  ggplot(main_ems.sf, aes(x=Temperature, y=ResponseTime)) +
    geom_point(size = .75, colour = "darkblue") +
    gridTheme,
  ggplot(main_ems.sf, aes(x=Wind_Speed, y=ResponseTime)) +
    geom_point(size = .75, colour = "darkblue") +
    gridTheme
)

As you can see from the scatter plots above, the hourly data on precipitation, temperature, and wind speed did not show a significant relationship on the response time. The following set of graphs show the three engineered binary features: SnowPresent, HeavyRain, and StrongWind. Calls during snow and strong wind conditions have longer response time. Surprisingly, calls during heavy rain had slightly shorter response time compared to calls during non heavy rain condition.

grid.arrange(
  ggplot(data = main_ems.sf %>%
           group_by(SnowPresent) %>%
           drop_na(SnowPresent) %>%
           dplyr::summarise(meanResponeTime = mean(ResponseTime, na.rm=TRUE))) +
    geom_bar(aes(x=SnowPresent, y=meanResponeTime, fill=SnowPresent), stat="identity", position=position_dodge(), show.legend = FALSE, width=0.5) +
    scale_fill_manual(values = palette2) + 
    gridTheme,
  ggplot(data = main_ems.sf %>%
           group_by(HeavyRain) %>%
           drop_na(HeavyRain) %>%
           dplyr::summarise(meanResponeTime = mean(ResponseTime, na.rm=TRUE))) +
    geom_bar(aes(x=HeavyRain, y=meanResponeTime, fill=HeavyRain), stat="identity", position=position_dodge(), show.legend = FALSE, width=0.5) +
    scale_fill_manual(values = palette2) + 
    gridTheme,
  ggplot(data = main_ems.sf %>%
           group_by(StrongWind) %>%
           drop_na(StrongWind) %>%
           dplyr::summarise(meanResponeTime = mean(ResponseTime, na.rm=TRUE))) +
    geom_bar(aes(x=StrongWind, y=meanResponeTime, fill=StrongWind), stat="identity", position=position_dodge(), show.legend = FALSE, width=0.5) +
    scale_fill_manual(values = palette2) + 
    gridTheme, 
  nrow=1, ncol=3
)

3.7 Correlations

As part of the exploratory analysis, we examined correlation to assist in identifying features that may be useful for predicting our dependent variable ResponseTime. A correlation matrix helps us visualize correlation across numeric variables. In the chart, the darker colors imply stronger correlation. These plot helps to determine features that are correlated to ResponseTime and variables that are correlated with each other. We find that many of the health outcome variables and census demographic variables are highly correlated with each other.

# CORRELATIONS
selected_vars <- 
  select(st_drop_geometry(main_ems.sf),
         ResponseTime,
         CallPriority,
         CallVolume,
         ems_station_nn,
         hospitals_nn,
         fire_stations_nn,
         b_advanced_life_support,
         b_chief_dispatch,
         b_special_event,
         b_air_dispatch,
         b_hold_dispatch,
         b_water_dispatch, 
         b_mass_casualty_dispatch,
         b_fire_dispatch,
         Temperature,
         Precipitation,
         Wind_Speed,
         Humidity,
         ACCESS2_CrudePrev,
         ARTHRITIS_CrudePrev,
         BINGE_CrudePrev,
         BPHIGH_CrudePrev,
         BPMED_CrudePrev,
         CANCER_CrudePrev,
         CASTHMA_CrudePrev,
         CHD_CrudePrev,
         CHECKUP_CrudePrev,
         CHOLSCREEN_CrudePrev,
         COLON_SCREEN_CrudePrev,
         COPD_CrudePrev,
         COREM_CrudePrev,
         COREW_CrudePrev,
         CSMOKING_CrudePrev,
         DENTAL_CrudePrev,
         DIABETES_CrudePrev,
         HIGHCHOL_CrudePrev,
         KIDNEY_CrudePrev,
         LPA_CrudePrev,
         MAMMOUSE_CrudePrev,
         MHLTH_CrudePrev,
         OBESITY_CrudePrev,
         PAPTEST_CrudePrev,
         PHLTH_CrudePrev,
         SLEEP_CrudePrev,
         STROKE_CrudePrev,
         TEETHLOST_CrudePrev,
         TotalPop,
         MedHHInc,
         TotalHH,
         pctWhite,
         pctBlack,
         pctHis,
         pctBachelors,
         pctPoverty,
         pctCarCommute,
         pctPubCommute,
         PopDens) %>% 
  na.omit()

ggcorrplot(
  round(cor(selected_vars), 1), 
  p.mat = cor_pmat(selected_vars),
  colors = c("#25CB10", "white", "#FA7800"),
  type="lower",
  insig = "blank") +  
  labs(title = "Correlation over selected features",
       caption="Correlation Plot") + gridTheme

4. Model Building

After exploring and testing correlations of all the variables, we have started our model building process by including all the features we have collected. The table below summarizes the results of 5 models we created in trying to find the best prediction model. The table provides the description of the models, Adjusted R^2, mean absolute error (MAE) and mean absolute percent error (MAPE) values for both training and testing datasets we have create in testing the models.

# TESTING MODEL 

#set up variables for model building
main_ems.sf$log_ResponseTime <- log(main_ems.sf$ResponseTime)
main_ems.sf$week <- factor(main_ems.sf$week)
main_ems.sf$dotw <- factor(main_ems.sf$dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
main_ems.sf$dotw <- relevel(main_ems.sf$dotw, ref = "Mon")
main_ems.sf$holiday <- factor(main_ems.sf$holiday, levels = c("Holiday", "Non-Holiday"))
main_ems.sf$holiday <- relevel(main_ems.sf$holiday, ref = "Holiday")

#set up training and testing dataframes
# set random seed
set.seed(31337)

# get index for training sample
inTrain <- caret::createDataPartition(
  y = paste(main_ems.sf$Zipcode),
  p = .60, list = FALSE)
# split data into training and test
vb_training <- main_ems.sf[inTrain,] 
vb_test     <- main_ems.sf[-inTrain,]  

#First, we will create a "kitchen-sink" model which has all variables that we've analyzed in this projects.
reg1 <- lm(ResponseTime ~ ., data = st_drop_geometry(main_ems.sf) %>% 
             dplyr::select( ResponseTime,
                            CallPriority,
                            CallVolume,
                            ems_station_nn,
                            hospitals_nn,
                            fire_stations_nn,
                            b_advanced_life_support,
                            b_chief_dispatch,
                            b_special_event,
                            b_air_dispatch,
                            b_hold_dispatch,
                            b_water_dispatch, 
                            b_mass_casualty_dispatch,
                            b_fire_dispatch,
                            Temperature,
                            Precipitation,
                            Wind_Speed,
                            Humidity,
                            ACCESS2_CrudePrev,
                            ARTHRITIS_CrudePrev,
                            BINGE_CrudePrev,
                            BPHIGH_CrudePrev,
                            BPMED_CrudePrev,
                            CANCER_CrudePrev,
                            CASTHMA_CrudePrev,
                            CHD_CrudePrev,
                            CHECKUP_CrudePrev,
                            CHOLSCREEN_CrudePrev,
                            COLON_SCREEN_CrudePrev,
                            COPD_CrudePrev,
                            COREM_CrudePrev,
                            COREW_CrudePrev,
                            CSMOKING_CrudePrev,
                            DENTAL_CrudePrev,
                            DIABETES_CrudePrev,
                            HIGHCHOL_CrudePrev,
                            KIDNEY_CrudePrev,
                            LPA_CrudePrev,
                            MAMMOUSE_CrudePrev,
                            MHLTH_CrudePrev,
                            OBESITY_CrudePrev,
                            PAPTEST_CrudePrev,
                            PHLTH_CrudePrev,
                            SLEEP_CrudePrev,
                            STROKE_CrudePrev,
                            TEETHLOST_CrudePrev,
                            TotalPop,
                            MedHHInc,
                            TotalHH,
                            pctWhite,
                            pctBlack,
                            pctHis,
                            pctBachelors,
                            pctPoverty,
                            pctCarCommute,
                            pctPubCommute,
                            PopDens,
                            week, 
                            dotw, 
                            time_of_day, 
                            times, 
                            holiday, 
                            season, 
                            weekend, 
                            SnowPresent, 
                            HeavyRain, 
                            StrongWind))
summary(reg1)

#get the MAE and MAPE for training and testing dataframes using reg1
vb_training.reg1 <-
  vb_training %>%
  mutate(ResponseTime.Predict = predict(reg1, vb_training),
         ResponseTime.Error = ResponseTime.Predict - ResponseTime,
         ResponseTime.AbsError = abs(ResponseTime.Predict - ResponseTime),
         ResponseTime.APE = (abs(ResponseTime.Predict - ResponseTime)) / ResponseTime.Predict)

vb_training.reg1_MAE <-mean(vb_training.reg1$ResponseTime.AbsError, na.rm = T)
vb_training.reg1_MAE
vb_training.reg1_MAPE <- mean(vb_training.reg1$ResponseTime.APE, na.rm = T)
vb_training.reg1_MAPE

vb_test.reg1 <-
  vb_test %>%
  mutate(ResponseTime.Predict = predict(reg1, vb_test),
         ResponseTime.Error = ResponseTime.Predict - ResponseTime,
         ResponseTime.AbsError = abs(ResponseTime.Predict - ResponseTime),
         ResponseTime.APE = (abs(ResponseTime.Predict - ResponseTime)) / ResponseTime.Predict)

vb_test.reg1_MAE <-mean(vb_test.reg1$ResponseTime.AbsError, na.rm = T)
vb_test.reg1_MAE
vb_test.reg1_MAPE <- mean(vb_test.reg1$ResponseTime.APE, na.rm = T)
vb_test.reg1_MAPE

#Initial Results produce an MAE of 2.6 for the training dataset and 2.62 for the testing dataset. the mean average percentage error 
#for these predictions was 28%

#next, we will run a stepwise function using backwards selection. This function will successively remove variables which do not significantly improve model outcomes.

#stepwise
#step(lm(ResponseTime ~ ., data = st_drop_geometry(main_ems.sf) %>% 
    #       dplyr::select(ResponseTime,
    #                     CallPriority,
    #                     CallVolume,
    #                     ems_station_nn,
    #                     hospitals_nn,
    #                     fire_stations_nn,
    #                     b_advanced_life_support,
    #                     b_chief_dispatch,
    #                     b_special_event,
    #                     b_air_dispatch,
    #                     b_hold_dispatch,
    #                     b_water_dispatch, 
    #                     b_mass_casualty_dispatch,
    #                     b_fire_dispatch,
    #                    Temperature,
    #                     Precipitation,
    #                     Wind_Speed,
    #                     Humidity,
    #                     ACCESS2_CrudePrev,
    #                     ARTHRITIS_CrudePrev,
    #                     BINGE_CrudePrev,
    #                     BPHIGH_CrudePrev,
    #                     BPMED_CrudePrev,
    #                     CANCER_CrudePrev,
    #                     CASTHMA_CrudePrev,
    #                     CHD_CrudePrev,
    #                     CHECKUP_CrudePrev,
    #                     CHOLSCREEN_CrudePrev,
    #                     COLON_SCREEN_CrudePrev,
    #                     COPD_CrudePrev,
    #                     COREM_CrudePrev,
    #                     COREW_CrudePrev,
    #                     CSMOKING_CrudePrev,
    #                     DENTAL_CrudePrev,
    #                     DIABETES_CrudePrev,
    #                     HIGHCHOL_CrudePrev,
    #                     KIDNEY_CrudePrev,
    #                     LPA_CrudePrev,
    #                     MAMMOUSE_CrudePrev,
    #                     MHLTH_CrudePrev,
    #                     OBESITY_CrudePrev,
    #                     PAPTEST_CrudePrev,
    #                     PHLTH_CrudePrev,
    #                     SLEEP_CrudePrev,
    #                     STROKE_CrudePrev,
    #                     TEETHLOST_CrudePrev,
    #                     TotalPop,
    #                     MedHHInc,
    #                     TotalHH,
    #                     pctWhite,
    #                    pctBlack,
    #                     pctHis,
    #                     pctBachelors,
    #                     pctPoverty,
    #                     pctCarCommute,
    #                     pctPubCommute,
    #                     PopDens,
    #                     week, 
    #                     dotw, 
    #                     time_of_day, 
    #                     times, 
    #                     holiday, 
    #                     season, 
    #                     weekend, 
    #                     SnowPresent, 
    #                     HeavyRain, 
    #                     StrongWind)), 
    # direction="backward")

#stepwise backward selection has removed the following variables:
# - KIDNEY_CrudePrev          1         2 727776 117748
# - pctPubCommute             1         2 727777 117748
# - Humidity                  1         3 727778 117748
# - CASTHMA_CrudePrev         1         4 727778 117748
# - SLEEP_CrudePrev           1         6 727781 117748
# - hospitals_nn              1        12 727786 117748
# - TEETHLOST_CrudePrev       1        12 727787 117748
# - pctPoverty                1        12 727787 117748
# - pctHis                    1        14 727788 117748
# - b_hold_dispatch           1        26 727801 117749
# - Wind_Speed                1        28 727802 117749
# - MAMMOUSE_CrudePrev        1        31 727805 117749
#and more, but the output of what was removed was hidden XD. only the variables remaining were still visible.

reg1.1<- lm(ResponseTime ~ ., data = st_drop_geometry(main_ems.sf) %>% 
     dplyr::select(ResponseTime,
                   CallPriority,
                   CallVolume,
                   ems_station_nn,
                   fire_stations_nn,
                   b_advanced_life_support,
                   b_chief_dispatch,
                   b_special_event,
                   b_air_dispatch,
                   b_water_dispatch,
                   b_mass_casualty_dispatch,
                   b_fire_dispatch,
                   Temperature,
                   Precipitation,
                   ACCESS2_CrudePrev,
                   ARTHRITIS_CrudePrev,
                   BINGE_CrudePrev,
                   BPHIGH_CrudePrev,
                   BPMED_CrudePrev,
                   CANCER_CrudePrev,
                   CHD_CrudePrev,
                   CHECKUP_CrudePrev,
                   CHOLSCREEN_CrudePrev,
                   COLON_SCREEN_CrudePrev,
                   COPD_CrudePrev,
                   COREM_CrudePrev,
                   COREW_CrudePrev,
                   CSMOKING_CrudePrev,
                   DENTAL_CrudePrev,
                   DIABETES_CrudePrev,
                   HIGHCHOL_CrudePrev,
                   LPA_CrudePrev,
                   OBESITY_CrudePrev,
                   PAPTEST_CrudePrev,
                   PHLTH_CrudePrev,
                   STROKE_CrudePrev,
                   TotalPop,
                   MedHHInc,
                   TotalHH,
                   pctWhite,
                   pctBachelors,
                   pctPoverty,
                   pctCarCommute,
                   PopDens,
                   week,
                   dotw,
                   times,
                   holiday,
                   SnowPresent,
                   HeavyRain,
                   StrongWind))


summary(reg1.1)

#get MAE and MAPE of reg1.1
vb_training.reg1.1 <-
  vb_training %>%
  mutate(ResponseTime.Predict = predict(reg1.1, vb_training),
         ResponseTime.Error = ResponseTime.Predict - ResponseTime,
         ResponseTime.AbsError = abs(ResponseTime.Predict - ResponseTime),
         ResponseTime.APE = (abs(ResponseTime.Predict - ResponseTime)) / ResponseTime.Predict)

vb_training.reg1.1_MAE <-mean(vb_training.reg1.1$ResponseTime.AbsError, na.rm = T)
vb_training.reg1.1_MAE
vb_training.reg1.1_MAPE <- mean(vb_training.reg1.1$ResponseTime.APE, na.rm = T)
vb_training.reg1.1_MAPE

vb_test.reg1.1 <-
  vb_test %>%
  mutate(ResponseTime.Predict = predict(reg1.1, vb_test),
         ResponseTime.Error = ResponseTime.Predict - ResponseTime,
         ResponseTime.AbsError = abs(ResponseTime.Predict - ResponseTime),
         ResponseTime.APE = (abs(ResponseTime.Predict - ResponseTime)) / ResponseTime.Predict)

vb_test.reg1.1_MAE <-mean(vb_test.reg1.1$ResponseTime.AbsError, na.rm = T)
vb_test.reg1.1_MAE
vb_test.reg1.1_MAPE <- mean(vb_test.reg1.1$ResponseTime.APE, na.rm = T)
vb_test.reg1.1_MAPE

#results from model 1.1 show a decreased MAE, from 2.621 to 2.620

#for regression 2, we will select only a some of the most signiifcant variables from regression 1.1. This model represents a more "lean" model in comparison to the "kitchen sink" model.

reg2<- lm(ResponseTime ~ ., data = st_drop_geometry(main_ems.sf) %>% 
              dplyr::select(ResponseTime,
                            CallPriority,
                            CallVolume,
                            ems_station_nn,
                            fire_stations_nn,
                            b_advanced_life_support,
                            b_chief_dispatch,
                            b_special_event,
                            b_air_dispatch,
                            b_water_dispatch,
                            b_mass_casualty_dispatch,
                            b_fire_dispatch,
                            Temperature,
                            BINGE_CrudePrev,
                            BPHIGH_CrudePrev,
                            CHD_CrudePrev,
                            COLON_SCREEN_CrudePrev,
                            CSMOKING_CrudePrev,
                            DENTAL_CrudePrev,
                            PAPTEST_CrudePrev,
                            PHLTH_CrudePrev,
                            STROKE_CrudePrev,
                            TotalPop,
                            PopDens,
                            week,
                            dotw,
                            times,
                            holiday,
                            SnowPresent,
                            StrongWind))


summary(reg2)

#get MAE and MAPE of reg2
vb_training.reg2 <-
  vb_training %>%
  mutate(ResponseTime.Predict = predict(reg2, vb_training),
         ResponseTime.Error = ResponseTime.Predict - ResponseTime,
         ResponseTime.AbsError = abs(ResponseTime.Predict - ResponseTime),
         ResponseTime.APE = (abs(ResponseTime.Predict - ResponseTime)) / ResponseTime.Predict)

vb_training.reg2_MAE <-mean(vb_training.reg2$ResponseTime.AbsError, na.rm = T)
vb_training.reg2_MAE
vb_training.reg2_MAPE <- mean(vb_training.reg2$ResponseTime.APE, na.rm = T)
vb_training.reg2_MAPE

vb_test.reg2 <-
  vb_test %>%
  mutate(ResponseTime.Predict = predict(reg2, vb_test),
         ResponseTime.Error = ResponseTime.Predict - ResponseTime,
         ResponseTime.AbsError = abs(ResponseTime.Predict - ResponseTime),
         ResponseTime.APE = (abs(ResponseTime.Predict - ResponseTime)) / ResponseTime.Predict)

vb_test.reg2_MAE <-mean(vb_test.reg2$ResponseTime.AbsError, na.rm = T)
vb_test.reg2_MAE
vb_test.reg2_MAPE <- mean(vb_test.reg2$ResponseTime.APE, na.rm = T)
vb_test.reg2_MAPE
#next, we will run a stepwise function using backwards selection. This function will successively remove variables which do not significantly improve model outcomes.

#stepwise
#step(lm(ResponseTime ~ ., data = st_drop_geometry(main_ems.sf) %>% 
  #        dplyr::select(ResponseTime,
  #                      CallPriority,
  #                      CallVolume,
  #                      ems_station_nn,
  #                      fire_stations_nn,
  #                      b_advanced_life_support,
  #                      b_chief_dispatch,
  #                      b_special_event,
  #                      b_air_dispatch,
  #                      b_water_dispatch,
  #                      b_mass_casualty_dispatch,
  #                      b_fire_dispatch,
  #                      Temperature,
  #                      BINGE_CrudePrev,
  #                      BPHIGH_CrudePrev,
  #                      CHD_CrudePrev,
  #                      COLON_SCREEN_CrudePrev,
  #                      CSMOKING_CrudePrev,
  #                      DENTAL_CrudePrev,
  #                       PAPTEST_CrudePrev,
  #                      PHLTH_CrudePrev,
  #                      STROKE_CrudePrev,
  #                      TotalPop,
  #                      PopDens,
  #                      week,
  #                      dotw,
  #                      times,
  #                      holiday,
  #                      SnowPresent,
  #                     StrongWind)), 
  #   direction="backward")

reg2.1<- lm(ResponseTime ~ ., data = st_drop_geometry(main_ems.sf) %>% 
            dplyr::select(ResponseTime,
                          CallPriority,
                          CallVolume,
                          ems_station_nn,
                          fire_stations_nn,
                          b_advanced_life_support,
                          b_chief_dispatch,
                          b_special_event,
                          b_air_dispatch,
                          b_water_dispatch,
                          b_mass_casualty_dispatch,
                          b_fire_dispatch,
                          Temperature,
                          BINGE_CrudePrev,
                          BPHIGH_CrudePrev,
                          CHD_CrudePrev,
                          CSMOKING_CrudePrev,
                          DENTAL_CrudePrev,
                          PAPTEST_CrudePrev,
                          PHLTH_CrudePrev,
                          STROKE_CrudePrev,
                          TotalPop,
                          PopDens,
                          week,
                          dotw,
                          times,
                          holiday,
                          SnowPresent,
                          StrongWind))


summary(reg2.1)

#get MAE and MAPE of reg2.1
vb_training.reg2.1 <-
  vb_training %>%
  mutate(ResponseTime.Predict = predict(reg2.1, vb_training),
         ResponseTime.Error = ResponseTime.Predict - ResponseTime,
         ResponseTime.AbsError = abs(ResponseTime.Predict - ResponseTime),
         ResponseTime.APE = (abs(ResponseTime.Predict - ResponseTime)) / ResponseTime.Predict)

vb_training.reg2.1_MAE <-mean(vb_training.reg2.1$ResponseTime.AbsError, na.rm = T)
vb_training.reg2.1_MAE
vb_training.reg2.1_MAPE <- mean(vb_training.reg2.1$ResponseTime.APE, na.rm = T)
vb_training.reg2.1_MAPE

vb_test.reg2.1 <-
  vb_test %>%
  mutate(ResponseTime.Predict = predict(reg2.1, vb_test),
         ResponseTime.Error = ResponseTime.Predict - ResponseTime,
         ResponseTime.AbsError = abs(ResponseTime.Predict - ResponseTime),
         ResponseTime.APE = (abs(ResponseTime.Predict - ResponseTime)) / ResponseTime.Predict)

vb_test.reg2.1_MAE <-mean(vb_test.reg2.1$ResponseTime.AbsError, na.rm = T)
vb_test.reg2.1_MAE
vb_test.reg2.1_MAPE <- mean(vb_test.reg2.1$ResponseTime.APE, na.rm = T)
vb_test.reg2.1_MAPE

#results from model are worse than results from the previous model. MAE increased form 2.62 to 2.63. Next we will log transform response time,
#and attempt to predict again with our best performing model yet, reg 1.1

reg1.2<- lm(log_ResponseTime ~ ., data = st_drop_geometry(main_ems.sf) %>% 
              dplyr::select(log_ResponseTime,
                            CallPriority,
                            CallVolume,
                            ems_station_nn,
                            fire_stations_nn,
                            b_advanced_life_support,
                            b_chief_dispatch,
                            b_special_event,
                            b_air_dispatch,
                            b_water_dispatch,
                            b_mass_casualty_dispatch,
                            b_fire_dispatch,
                            Temperature,
                            Precipitation,
                            ACCESS2_CrudePrev,
                            ARTHRITIS_CrudePrev,
                            BINGE_CrudePrev,
                            BPHIGH_CrudePrev,
                            BPMED_CrudePrev,
                            CANCER_CrudePrev,
                            CHD_CrudePrev,
                            CHECKUP_CrudePrev,
                            CHOLSCREEN_CrudePrev,
                            COLON_SCREEN_CrudePrev,
                            COPD_CrudePrev,
                            COREM_CrudePrev,
                            COREW_CrudePrev,
                            CSMOKING_CrudePrev,
                            DENTAL_CrudePrev,
                            DIABETES_CrudePrev,
                            HIGHCHOL_CrudePrev,
                            LPA_CrudePrev,
                            OBESITY_CrudePrev,
                            PAPTEST_CrudePrev,
                            PHLTH_CrudePrev,
                            STROKE_CrudePrev,
                            TotalPop,
                            MedHHInc,
                            TotalHH,
                            pctWhite,
                            pctBachelors,
                            pctPoverty,
                            pctCarCommute,
                            PopDens,
                            week,
                            dotw,
                            times,
                            holiday,
                            SnowPresent,
                            HeavyRain,
                            StrongWind) %>%
              dplyr::filter(is.infinite(log_ResponseTime) == FALSE))
summary(reg1.2)

vb_training.reg1.2 <-
  vb_training %>%
  mutate(log_ResponseTime.Predict = predict(reg1.2, vb_training),
         ResponseTime.Predict = exp(log_ResponseTime.Predict),
         ResponseTime.Error = ResponseTime.Predict - ResponseTime,
         ResponseTime.AbsError = abs(ResponseTime.Predict - ResponseTime),
         ResponseTime.APE = (abs(ResponseTime.Predict - ResponseTime)) / ResponseTime.Predict)

vb_training.reg1.2_MAE <-mean(vb_training.reg1.2$ResponseTime.AbsError, na.rm = T)
vb_training.reg1.2_MAE
vb_training.reg1.2_MAPE <- mean(vb_training.reg1.2$ResponseTime.APE, na.rm = T)
vb_training.reg1.2_MAPE

vb_test.reg1.2 <-
  vb_test %>%
  mutate(log_ResponseTime.Predict = predict(reg1.2, vb_test),
         ResponseTime.Predict = exp(log_ResponseTime.Predict),
         ResponseTime.Error = ResponseTime.Predict - ResponseTime,
         ResponseTime.AbsError = abs(ResponseTime.Predict - ResponseTime),
         ResponseTime.APE = (abs(ResponseTime.Predict - ResponseTime)) / ResponseTime.Predict)

vb_test.reg1.2_MAE <-mean(vb_test.reg1.2$ResponseTime.AbsError, na.rm = T)
vb_test.reg1.2_MAE
vb_test.reg1.2_MAPE <- mean(vb_test.reg1.2$ResponseTime.APE, na.rm = T)
vb_test.reg1.2_MAPE

#results from log transformed variables have been the best yet - with mean absolute error from out test set at 2.58.

4.1 Summary of models

regression_table <-data.frame(
  Regression = c("reg1", "reg1.1", "reg2", "reg2.1", "reg1.2"),
  Description = c("Kitchen sink model with all features", "trimmed down model of reg1 using stepwise backward selection function", "model with most signiifcant variables from regression 1.1", "trimmed down model of reg2 using stepwise backward selection function", "FINAL MODEL: model with log transformed Response Time feature (dependent variable)"), 
  "Adjusted R2" = c("0.2211","0.2213","0.2159","0.2159","0.2651"),
  "MAE on Training set" = c("2.6389","2.6379","2.6540","2.6542","2.5987"),
  "MAPE on Training set" = c("28.712%","28.998%","29.772%","29.768%","30.865%"),
  "MAE on Testing set" = c("2.5646","2.5649","2.5806","2.5809","2.5205"),
  "MAPE on Testing set" = c("28.417%","28.448%","28.459%","28.459%","29.964%")
)

kbl(regression_table, caption ="<b>Summary of regression models<b>") %>%
  kable_paper(c("hover", "condensed"), full_width = T, html_font = "Avenir") %>%
  kable_styling (bootstrap_options = "striped", "condensed", font_size = 10) %>%
  row_spec(0, bold = T, color = "white", background = "#2e4057", font_size = 10) %>%
  row_spec(5, bold = T, background = "#bbdefb") %>%
  column_spec(1, bold = T)
Summary of regression models
Regression Description Adjusted.R2 MAE.on.Training.set MAPE.on.Training.set MAE.on.Testing.set MAPE.on.Testing.set
reg1 Kitchen sink model with all features 0.2211 2.6389 28.712% 2.5646 28.417%
reg1.1 trimmed down model of reg1 using stepwise backward selection function 0.2213 2.6379 28.998% 2.5649 28.448%
reg2 model with most signiifcant variables from regression 1.1 0.2159 2.6540 29.772% 2.5806 28.459%
reg2.1 trimmed down model of reg2 using stepwise backward selection function 0.2159 2.6542 29.768% 2.5809 28.459%
reg1.2 FINAL MODEL: model with log transformed Response Time feature (dependent variable) 0.2651 2.5987 30.865% 2.5205 29.964%

First model reg1 is a “kitchen-sink” model which has all variables that we’ve analyzed in this projects. The second model reg1.1 is a trimmed down model of reg1 using the stepwise backward selection method. This process drops each individual variable at a time to find a model with the least AIC (Akaike information criterion) which estimates the quality of each model relative to other models. the output of this process, reg1.1 was output of the stepwise function with the smallest AIC value, indicating this output is the statistically best model from the variables from reg1. However, the improvement of th model seems minimal when considering the *Adjusted R^2, mean absolute error (MAE) and mean absolute percent error (MAPE)** values.

Next, we selected the variables that had the most statistical significance from reg1 to create reg2. This model represents a more “lean” model in comparison to the “kitchen sink” model. Unfortunately, reg2 did now show improvement from the previous two models - it actually had slightly higher MAE and MAPE values on both testing and training sets. Then, we conducted the stepwise backwards selection method to reg2 to get a leaner model by removing variables which do not significantly improve model outcomes. The result is reg2.1. However, the result was disappointing as the reg2.1 was not statistically better than any of the previous models.

Using our best performing model yet, reg 1.1, we will create another model reg1.2 with log transform response time as the dependent variable. The results from reg1.2 was the statistically best among the 5 regression models - with the highest adjusted R^2 and mean absolute error from out test set at 2.59, which is the smallest among the five models. Comparing results from the five models, we concluded that reg1.2 is our final model.

4.2 Final Model

Summary table of our final model reg1.2 is below. Most of the variables showed statistical significance, with the exception of Precipitation and some categorical variables in days of the week, week, and 3 hour time bins.

Summary of final model (reg1.2)
Estimate CI (lower) CI (upper) Std. Error t value Pr(>|t|)
(Intercept) -3.1132315 -5.4543564 -0.7721067 1.1944372 -2.6064422 0.009 **
CallPriority 0.3982936 0.3900130 0.4065742 0.0042248 94.2762460 <0.001 ***
CallVolume 0.0045684 0.0036226 0.0055142 0.0004826 9.4672706 <0.001 ***
ems_station_nn 0.0000105 0.0000084 0.0000126 0.0000011 9.9128039 <0.001 ***
fire_stations_nn 0.0000206 0.0000183 0.0000229 0.0000012 17.8656434 <0.001 ***
b_advanced_life_support -0.0406975 -0.0607071 -0.0206878 0.0102089 -3.9864815 <0.001 ***
b_chief_dispatch -1.2356933 -1.4640497 -1.0073369 0.1165070 -10.6061741 <0.001 ***
b_special_event -0.8954262 -1.1459180 -0.6449344 0.1278004 -7.0064420 <0.001 ***
b_air_dispatch -2.5800117 -3.3700947 -1.7899287 0.4030988 -6.4004458 <0.001 ***
b_water_dispatch -2.1577705 -2.3204651 -1.9950758 0.0830065 -25.9952024 <0.001 ***
b_mass_casualty_dispatch -2.2896657 -2.8489422 -1.7303892 0.2853417 -8.0242931 <0.001 ***
b_fire_dispatch -0.4777291 -0.5178095 -0.4376487 0.0204489 -23.3620372 <0.001 ***
Temperature -0.0011936 -0.0016416 -0.0007457 0.0002285 -5.2228237 <0.001 ***
Precipitation 0.0502689 -0.0220267 0.1225645 0.0368851 1.3628530 0.173
ACCESS2_CrudePrev 0.0423271 0.0073753 0.0772789 0.0178323 2.3736162 0.018
ARTHRITIS_CrudePrev 0.1542271 0.0823430 0.2261111 0.0366751 4.2052247 <0.001 ***
BINGE_CrudePrev -0.0471777 -0.0673994 -0.0269561 0.0103170 -4.5727982 <0.001 ***
BPHIGH_CrudePrev -0.1803228 -0.2229910 -0.1376546 0.0217692 -8.2833846 <0.001 ***
BPMED_CrudePrev 0.0253844 0.0175658 0.0332030 0.0039890 6.3635433 <0.001 ***
CANCER_CrudePrev -0.1624554 -0.2830690 -0.0418417 0.0615368 -2.6399689 0.008 **
CHD_CrudePrev 0.2824107 0.1775136 0.3873078 0.0535183 5.2769010 <0.001 ***
CHECKUP_CrudePrev 0.0885607 0.0612486 0.1158728 0.0139346 6.3554594 <0.001 ***
CHOLSCREEN_CrudePrev 0.0254223 0.0104385 0.0404061 0.0076447 3.3254766 0.001 ***
COLON_SCREEN_CrudePrev -0.0327033 -0.0431146 -0.0222920 0.0053118 -6.1566969 <0.001 ***
COPD_CrudePrev -0.2390515 -0.3597061 -0.1183970 0.0615577 -3.8833728 <0.001 ***
COREM_CrudePrev 0.0137210 0.0059881 0.0214539 0.0039453 3.4778142 0.001 ***
COREW_CrudePrev 0.0198775 0.0088735 0.0308815 0.0056142 3.5405517 <0.001 ***
CSMOKING_CrudePrev 0.2217597 0.1786902 0.2648292 0.0219740 10.0919292 <0.001 ***
DENTAL_CrudePrev 0.0556168 0.0396612 0.0715724 0.0081405 6.8321063 <0.001 ***
DIABETES_CrudePrev 0.1011248 0.0523429 0.1499067 0.0248884 4.0631262 <0.001 ***
HIGHCHOL_CrudePrev -0.0275118 -0.0622687 0.0072450 0.0177329 -1.5514583 0.121
LPA_CrudePrev -0.0543152 -0.0917850 -0.0168453 0.0191170 -2.8411922 0.004 **
OBESITY_CrudePrev 0.0371693 0.0143303 0.0600083 0.0116524 3.1898335 0.001 **
PAPTEST_CrudePrev -0.0737060 -0.0969171 -0.0504949 0.0118423 -6.2239765 <0.001 ***
PHLTH_CrudePrev -0.4554324 -0.5613511 -0.3495136 0.0540395 -8.4277652 <0.001 ***
STROKE_CrudePrev 0.7296856 0.5842256 0.8751457 0.0742134 9.8322596 <0.001 ***
TotalPop 0.0001418 0.0001114 0.0001722 0.0000155 9.1416998 <0.001 ***
MedHHInc 0.0000011 0.0000005 0.0000017 0.0000003 3.5650992 <0.001 ***
TotalHH -0.0001384 -0.0001701 -0.0001066 0.0000162 -8.5515243 <0.001 ***
pctWhite -0.0017434 -0.0029826 -0.0005041 0.0006323 -2.7572932 0.006 **
pctBachelors 0.0185422 0.0092931 0.0277913 0.0047189 3.9293804 <0.001 ***
pctPoverty -0.0017964 -0.0035855 -0.0000073 0.0009128 -1.9679870 0.049
pctCarCommute -0.0027406 -0.0044446 -0.0010366 0.0008694 -3.1523756 0.002 **
PopDens 0.0000125 0.0000090 0.0000160 0.0000018 7.0443412 <0.001 ***
week: 2 0.0131820 -0.0158713 0.0422352 0.0148229 0.8892978 0.374
week: 3 -0.0863993 -0.1151022 -0.0576963 0.0146442 -5.8999017 <0.001 ***
week: 4 -0.0663616 -0.0957897 -0.0369336 0.0150141 -4.4199421 <0.001 ***
week: 5 -0.1070271 -0.1366171 -0.0774372 0.0150967 -7.0894161 <0.001 ***
week: 6 -0.0784309 -0.1077537 -0.0491081 0.0149604 -5.2425597 <0.001 ***
week: 7 -0.0984432 -0.1281713 -0.0687151 0.0151672 -6.4905305 <0.001 ***
week: 8 -0.0914378 -0.1221571 -0.0607186 0.0156729 -5.8341350 <0.001 ***
week: 9 -0.0935860 -0.1276978 -0.0594742 0.0174038 -5.3773396 <0.001 ***
week: 10 -0.0890749 -0.1255103 -0.0526396 0.0185892 -4.7917485 <0.001 ***
week: 11 -0.0858476 -0.1213243 -0.0503710 0.0181001 -4.7429384 <0.001 ***
week: 12 -0.1024382 -0.1378718 -0.0670046 0.0180781 -5.6664105 <0.001 ***
week: 13 -0.0670686 -0.1035409 -0.0305963 0.0186081 -3.6042697 <0.001 ***
week: 14 -0.0719161 -0.1088761 -0.0349561 0.0188569 -3.8137808 <0.001 ***
week: 15 -0.0671849 -0.1039849 -0.0303848 0.0187753 -3.5783638 <0.001 ***
week: 16 -0.0659942 -0.1037334 -0.0282549 0.0192545 -3.4274705 0.001 ***
week: 17 -0.1034232 -0.1412319 -0.0656144 0.0192900 -5.3615026 <0.001 ***
week: 18 -0.0613402 -0.0993126 -0.0233679 0.0193734 -3.1662047 0.002 **
week: 19 -0.0687393 -0.1048103 -0.0326684 0.0184033 -3.7351564 <0.001 ***
week: 20 -0.0764258 -0.1147086 -0.0381429 0.0195318 -3.9128799 <0.001 ***
week: 21 -0.0589667 -0.0964110 -0.0215224 0.0191040 -3.0866146 0.002 **
week: 22 -0.0363186 -0.0747731 0.0021359 0.0196194 -1.8511580 0.064 .
week: 23 -0.0717852 -0.1101176 -0.0334529 0.0195571 -3.6705477 <0.001 ***
week: 24 -0.0268815 -0.0658984 0.0121353 0.0199063 -1.3504027 0.177
week: 25 -0.0741351 -0.1132575 -0.0350127 0.0199602 -3.7141544 <0.001 ***
week: 26 -0.0762266 -0.1156268 -0.0368264 0.0201019 -3.7920056 <0.001 ***
week: 27 -0.0764348 -0.1159873 -0.0368824 0.0201796 -3.7877305 <0.001 ***
week: 28 -0.0396248 -0.0802038 0.0009541 0.0207033 -1.9139372 0.056 .
week: 29 -0.0605742 -0.1010058 -0.0201425 0.0206282 -2.9364811 0.003 **
week: 30 -0.0388731 -0.0793917 0.0016455 0.0206725 -1.8804237 0.06 .
week: 31 -0.0730560 -0.1128094 -0.0333025 0.0202821 -3.6019866 <0.001 ***
week: 32 0.0050153 -0.1838335 0.1938642 0.0963503 0.0520532 0.958
week: 42 -0.0803544 -0.1218713 -0.0388374 0.0211819 -3.7935442 <0.001 ***
week: 43 -0.0869144 -0.1245497 -0.0492791 0.0192014 -4.5264489 <0.001 ***
week: 44 -0.0868406 -0.1256245 -0.0480568 0.0197874 -4.3886746 <0.001 ***
week: 45 -0.1068197 -0.1441616 -0.0694777 0.0190518 -5.6068086 <0.001 ***
week: 46 -0.0930224 -0.1296579 -0.0563870 0.0186913 -4.9767706 <0.001 ***
week: 47 -0.1185378 -0.1555538 -0.0815219 0.0188854 -6.2766766 <0.001 ***
week: 48 -0.0796076 -0.1158013 -0.0434139 0.0184659 -4.3110498 <0.001 ***
week: 49 -0.0837227 -0.1198153 -0.0476302 0.0184143 -4.5466050 <0.001 ***
week: 50 -0.1123765 -0.1481922 -0.0765607 0.0182731 -6.1498191 <0.001 ***
week: 51 -0.1277134 -0.1641889 -0.0912378 0.0186098 -6.8627063 <0.001 ***
week: 52 -0.1049366 -0.1415204 -0.0683528 0.0186650 -5.6221106 <0.001 ***
dotw: Tue -0.0069231 -0.0218203 0.0079740 0.0076005 -0.9108804 0.362
dotw: Wed -0.0036954 -0.0187799 0.0113891 0.0076961 -0.4801717 0.631
dotw: Thu 0.0001577 -0.0148931 0.0152084 0.0076789 0.0205341 0.984
dotw: Fri -0.0207695 -0.0358671 -0.0056718 0.0077028 -2.6963617 0.007 **
dotw: Sat -0.0559425 -0.0711165 -0.0407686 0.0077417 -7.2261143 <0.001 ***
dotw: Sun -0.0444422 -0.0599940 -0.0288904 0.0079345 -5.6011374 <0.001 ***
times: fifteen-eighteen 0.0330945 0.0185826 0.0476065 0.0074040 4.4698334 <0.001 ***
times: nine-twelve 0.0068755 -0.0086455 0.0223966 0.0079188 0.8682546 0.385
times: one-three 0.0671557 0.0478257 0.0864857 0.0098621 6.8094590 <0.001 ***
times: six-nine 0.0170044 -0.0003896 0.0343984 0.0088744 1.9161252 0.055 .
times: three-six 0.1149854 0.0926699 0.1373008 0.0113853 10.0994488 <0.001 ***
times: twelve-fifteen 0.0229075 0.0081244 0.0376905 0.0075423 3.0372058 0.002 **
times: twentyone-twentyfour 0.0200573 0.0037328 0.0363817 0.0083287 2.4082102 0.016
holiday: Non-Holiday 0.0639602 0.0336640 0.0942563 0.0154570 4.1379342 <0.001 ***
SnowPresent: Snow 0.2661831 0.2209938 0.3113724 0.0230555 11.5453225 <0.001 ***
HeavyRain: NoHeavyRain 0.0998800 0.0030581 0.1967020 0.0493984 2.0219296 0.043
StrongWind: StrongWind 0.2936691 0.1518332 0.4355050 0.0723644 4.0581991 <0.001 ***

Using the final model, the MAE of the Training set comes out to be 2.598714 and 2.520527 for the testing set. The charts below show that our final underpredicts the response time of the EMS calls on both testing and training sets. The red line represents the one-to-one line and you can see that the model’s predictions fall under the line, meaning underprediction.

#Plotting accuracy metrics
preds.train <- data.frame(pred   = vb_training.reg1.2$ResponseTime.Predict,
                          actual = vb_training.reg1.2$ResponseTime,
                          source = "training data")
preds.test  <- data.frame(pred   = vb_test.reg1.2$ResponseTime.Predict,
                          actual = vb_test.reg1.2$ResponseTime,
                          source = "testing data")
preds <- rbind(preds.train, preds.test)

ggplot(preds, aes(x = actual, y = pred, color = source)) +
  geom_point() +
  scale_fill_manual(values = palette2) +
  geom_smooth(method = "lm", color = "green") +
  geom_abline(color = "red") +
  coord_equal() +
  theme_bw() +
  facet_wrap(~source, nrow = 2) +
  labs(title = "Comparing predictions to actual ResponeTime for the test set and the training set",
       caption="Predicted ResponeTime as a Function of Observed ResponseTime for the Test Set and the Training Set",
       x = "Actual ResponseTime",
       y = "Predicted Response") +
  plotTheme +
  theme(
    legend.position = "none")

#ggplot(preds, aes(x = actual, y = pred)) +
#  geom_point() +
#  geom_smooth(method = "lm", color = "green") +
#  geom_abline(color = "orange") +
#  coord_equal() +
#  theme_bw() +
#  labs(title = "Comparing predictions to actual response times for all calls",
#       caption="Predicted response time as a Function of Observed response time",
#       x = "Actual ResponseTime",
#       y = "Predicted ResponseTime") +
#  plotTheme

5. Cross Validation

To test the generalizability random 100-fold method was conducted with the final model (reg1.2).

fitControl <- trainControl(method = "cv", number = 100, savePredictions = TRUE)
set.seed(825)

reg.cv <- 
  train(log_ResponseTime ~ ., data = st_drop_geometry(vb_training) %>% 
             dplyr::select(log_ResponseTime,
                           CallPriority,
                           CallVolume,
                           ems_station_nn,
                           fire_stations_nn,
                           b_advanced_life_support,
                           b_chief_dispatch,
                           b_special_event,
                           b_air_dispatch,
                           b_water_dispatch,
                           b_mass_casualty_dispatch,
                           b_fire_dispatch,
                           Temperature,
                           Precipitation,
                           ACCESS2_CrudePrev,
                           ARTHRITIS_CrudePrev,
                           BINGE_CrudePrev,
                           BPHIGH_CrudePrev,
                           BPMED_CrudePrev,
                           CANCER_CrudePrev,
                           CHD_CrudePrev,
                           CHECKUP_CrudePrev,
                           CHOLSCREEN_CrudePrev,
                           COLON_SCREEN_CrudePrev,
                           COPD_CrudePrev,
                           COREM_CrudePrev,
                           COREW_CrudePrev,
                           CSMOKING_CrudePrev,
                           DENTAL_CrudePrev,
                           DIABETES_CrudePrev,
                           HIGHCHOL_CrudePrev,
                           LPA_CrudePrev,
                           OBESITY_CrudePrev,
                           PAPTEST_CrudePrev,
                           PHLTH_CrudePrev,
                           STROKE_CrudePrev,
                           TotalPop,
                           MedHHInc,
                           TotalHH,
                           pctWhite,
                           pctBachelors,
                           pctPoverty,
                           pctCarCommute,
                           PopDens,
                           week,
                           dotw,
                           times,
                           holiday,
                           SnowPresent,
                           HeavyRain,
                           StrongWind) %>%
          dplyr::filter(is.infinite(log_ResponseTime) == FALSE), 
        method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv$resample

The plots below show the result of the 100 fold cross validation process. Mean absolute Error (MAE), root-mean-square deviation (RMSE), and Rsquared of the 100 folds seem to cluster except for a few outliers. This indicates that our final model can be generalizable to new data.

reg.cv$resample %>% 
  pivot_longer(-Resample) %>% 
  mutate(name = as.factor(name)) %>% 
  ggplot(., aes(x = name, y = value, color = name)) +
  geom_jitter(width = 0.1) +
  facet_wrap(~name, ncol = 3, scales = "free") +
  theme_bw() +
  theme(legend.position = "none") +
  plotTheme

The normal distribution of the MAE of the 100 fold test also confirms that the model can be generalizable to new data with some errors.

reg.cv.dataframe<-
  reg.cv$resample %>% 
  dplyr::select(MAE) %>%
  data.frame()


#Distribution of Mean Average Error from 100 Fold Test
ggplot(reg.cv.dataframe, aes(x=MAE)) +
  geom_histogram(fill = "darkblue") +
  labs(title="Distribution of Mean Average Error from 100 Fold Test",
       x="Mean Average Error", 
       y="Frequency")+
  plotTheme

6. Conclusion

As discussed in the introduction, the purpose of this project is to amplify the presence of shared information in the event of a medical emergency by providing an estimated time of arrival for emergency medical services (EMS). We have used a series of features, ranging from temporal to spatial to demographic data, to come up with a model that can help predict the response time of an EMS. Incorporating additional features like network distances to EMS stations may improve the model in predicting the response time of EMS calls.